Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use unique string IDs #142

Merged
merged 96 commits into from
Jul 3, 2023
Merged

Use unique string IDs #142

merged 96 commits into from
Jul 3, 2023

Conversation

CunliangGeng
Copy link
Member

This is a very big PR with a lot of changes. It's very important to keep in mind that the purpose is to use unique string ID for scoring.

Brief background:

  • GCF, Spectrum and MolecularFamily objects have integer and/or string IDs:
    • GCF.id and GCF.gcf_id
    • Spectrum.id and Spectrum.spectrum_id
    • MolecularFamily.id and MolecularFamily.family_id
  • The string ID might not exist and are also not really used.
  • The integer IDs (*.id) are used for scoring part. But it caused a lot trouble because
    • The value of these IDs are not unique and can change when the order of the objects is changed
      • e.g. in [gcf1, gcf2] gcf1.id is 0; but [gcf2, gcf1] gcf1.id will be 1.
    • Scoring code use these IDs as the index to get the value from numpy array. However when the order of the objects is changed (the array does not change), it's totally wrong to use the new ID to select value from the array.

The solution is to only use string ID (unique) and remove the changeable integer ID.

Major Changes

  • change GCF.gcf_id, Specrum.spectrum_id and MolecularFamily.family_id to string ID

  • use string ID for scoring (DataLinks, LinkFinder, MetcalfScoring)

    • replace np.array with pandas.DataFrame (use string ID as index/column names) to store score data
    • update logics of code to use data frame and to use string ID to select values
  • remove the assignment of integer ID (*.id)

  • remove the use of integer ID (*.id) in StrainCollection

  • add docstrings and unit tests for scoring related classes

  • add TODO comments to files for further refactoring

Important but not involved in this PR

  • make sure all string ID of all objects are unique
  • remove *.id from all objects

The following BGC attributes are updated:
- product_prediction
- mibig_bgc_class
- smiles
to figure out how `spectrum_id` is set
@CunliangGeng
Copy link
Member Author

CunliangGeng commented May 15, 2023

Ready for review ;-)
You could ignore the CI tests since there is an error in BigScape dependency and I have requested its maintainer to fix it.
All the unit tests have passed locally.

Copy link
Collaborator

@hechth hechth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work! I'm just a bit confused by the testing of NPLinker scores and Metcalf scores - it seems like some of the tests are duplicated, or almost at least?

Comment on lines -248 to +249
strains.remove(bgc.strain)
if bgc.strain is not None:
strains.remove(bgc.strain)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the scenario where the strain object of the bgc is None?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The strain is None by default, so I guess this happens in all cases in which the strain is not set through the strain setter method

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Giulia is right. The strain might be None when BGC instances are not created using BGC Loaders (in the loaders, the strain will be set).

src/nplinker/metabolomics/metabolomics.py Outdated Show resolved Hide resolved
src/nplinker/metabolomics/singleton_family.py Outdated Show resolved Hide resolved
src/nplinker/nplinker.py Show resolved Hide resolved
src/nplinker/scoring/linking/data_links.py Show resolved Hide resolved
src/nplinker/scoring/linking/link_finder.py Outdated Show resolved Hide resolved
src/nplinker/scoring/linking/link_finder.py Outdated Show resolved Hide resolved
src/nplinker/scoring/linking/link_finder.py Show resolved Hide resolved
tests/scoring/test_metcalf_scoring.py Outdated Show resolved Hide resolved
Copy link
Contributor

@gcroci2 gcroci2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What a huge work! Great :D

I added few minor comments, and I checked the code starting from the tests and trying then to dive deep into the internals. Since I don't know those parts of the package well, I am missing something here and there but overall I got the scope of the PR and I tried to check stuff like type hinting, doc strings, and comments. As I already mentioned, maybe next time it would be good to go through a PR like this together before reviewing it.

src/nplinker/nplinker.py Outdated Show resolved Hide resolved
tests/test_nplinker_local.py Show resolved Hide resolved
tests/test_strain_collection.py Show resolved Hide resolved
Comment on lines -248 to +249
strains.remove(bgc.strain)
if bgc.strain is not None:
strains.remove(bgc.strain)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The strain is None by default, so I guess this happens in all cases in which the strain is not set through the strain setter method

src/nplinker/metabolomics/spectrum.py Show resolved Hide resolved
@CunliangGeng
Copy link
Member Author

Awesome work! I'm just a bit confused by the testing of NPLinker scores and Metcalf scores - it seems like some of the tests are duplicated, or almost at least?

One is testing NPLinker object, another is testing Metcalf object. Though some tests look similar, they are serving different purposes and testing different objects.

@CunliangGeng CunliangGeng linked an issue Jun 21, 2023 that may be closed by this pull request
3 tasks
src/nplinker/metabolomics/singleton_family.py Show resolved Hide resolved
src/nplinker/scoring/metcalf_scoring.py Outdated Show resolved Hide resolved
@CunliangGeng CunliangGeng merged commit bd35f65 into dev Jul 3, 2023
0 of 2 checks passed
@CunliangGeng CunliangGeng deleted the use_unique_string_id branch July 4, 2023 08:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants